Acta Psychiatrica Scandinavica
○ Wiley
Preprints posted in the last 90 days, ranked by how well they match Acta Psychiatrica Scandinavica's content profile, based on 10 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit.
Varone, G.; Kumar, P.; Brown, J.; Boulila, W.
Show abstract
Psychiatric disorders are fundamentally challenged by symptom heterogeneity, high comorbidity, and the absence of objective biomarkers, which together result in substantial variability in clinical assessment and treatment selection. Patient-generated language captures rich information about subjective experience and symptom severity, which can be systematically encoded and analyzed using computational models, making it a scalable signal for psychiatric assessment. We compare two approaches: (i) a domain-specialized transformer fine-tuned on clinical language, based on the Bio-ClinicalBERT encoder architecture, and (ii) a large-scale instruction-tuned generalist encoder (Instructor-XL) used as a frozen feature extractor with a shallow classification head. A corpus of N = 151,228 de-identified texts was compiled from five public sources, covering four psychiatric phenotypes: anxiety, depression, schizophrenia, and suicidal intention. Models were evaluated using stratified 10-fold cross-validation with cost-sensitive training, prioritizing imbalance-aware metrics, including Macro-F1 and Matthews Correlation Coefficient (MCC), over accuracy. Bio-ClinicalBERT achieved superior overall performance (Macro-F1 = 0.78, MCC = 0.6752), indicating more reliable separation of diagnostically overlapping affective categories. In contrast, Instructor-XL achieved its highest class-specific performance for schizophrenia (F1 = 0.798). Explainability analyses suggest that the domain-specialized model places greater weight on clinically relevant terms, whereas the generalist model relies on a broader set of lexical features.
Trivedi, S.; Simons, N. W.; Tyagi, A.; Ramaswamy, A.; Nadkarni, G. N.; Charney, A. W.
Show abstract
Background: Large language models (LLMs) are increasingly used in mental health contexts, yet their detection of suicidal ideation is inconsistent, raising patient safety concerns. Objective: To evaluate whether an independent safety monitoring system improves detection of suicide risk compared with native LLM safeguards. Methods: We conducted a cross-sectional evaluation using 224 paired suicide-related clinical vignettes presented in a single-turn format under two conditions (with and without structured clinical information). Native LLM safeguard responses were compared with an independent supervisory safety architecture with asynchronous monitoring. The primary outcome was detection of suicide risk requiring intervention. Results: The supervisory system detected suicide risk in 205 of 224 evaluations (91.5%) versus 41 of 224 (18.3%) for native LLM safeguards. Among 168 discordant evaluations, 166 favored the supervisory system and 2 favored the LLM (matched odds ratio {approx}83.0). Both systems detected risk in 39 evaluations, and neither in 17. Detection was highest in scenarios with explicit suicidal ideation and lower in more ambiguous presentations. Conclusions: Native LLM safeguards frequently failed to detect suicide risk in this structured evaluation. An independent monitoring approach substantially improved detection, supporting the role of external safety systems in high-risk mental health applications of LLMs.
Jin, K. W.; Rostam-Abadi, Y.; Chaudhary, P.; Garrett, M. A.; Huang, A. S.; Montelongo, M.; Nagpal, C.; Shei, J.; Weathers, J.; Zhang, J. S.; Chen, Q.; Kim, J.; Malgaroli, M.; Mathis, W. S.; Rodriguez, C. I.; Selek, S.; Sharma, M. S.; Pittenger, C.; Yip, S. W.; Zaboski, B. A.; Xu, H.
Show abstract
ImportanceLarge language models (LLMs) have demonstrated diagnostic potential in several medical specialties, but their application to psychiatry - where diagnosis relies heavily on clinical judgment, narrative interpretation, and reasoning under uncertainty - remains insufficiently evaluated. ObjectiveTo evaluate diagnostic accuracy and clinician-judged reasoning quality of multiple large language models using psychiatric case vignettes. DesignMixed-methods evaluation study of diagnostic accuracy across four LLMs using 196 psychiatric case vignettes (135 published and 61 novel). Clinical reasoning quality was evaluated on a randomly selected subset of 30 vignettes using structured clinician ratings along two reasoning dimensions. The highest-performing model was illustratively compared with psychiatry trainees on the same subset. Diagnostic correctness for the full vignette set was assessed by a separate adjudicator LLM. SettingPublicly available model interfaces, December 2025. ParticipantsFive board-certified psychiatrists evaluated model-generated clinical reasoning. Two psychiatry residents served as the illustrative human comparison. Main Outcomes and MeasuresDiagnostic accuracy and clinician-rated clinical reasoning quality. Diagnostic accuracy was assessed using top-1 accuracy, top-5 accuracy, recall@5, and mean reciprocal rank based on ranked lists of five differential diagnoses per vignette. Clinical reasoning quality was assessed using two 5-point Likert scales adapted from the American Council of Graduate Medical Education Psychiatry Residency Milestones, evaluating data extraction and diagnostic reasoning. ResultsAcross 196 psychiatric case vignettes, Claude Opus 4.5 (Anthropic) achieved the highest diagnostic accuracy (top-1 accuracy, 0.638; top-5 accuracy, 0.801; recall@5, 0.731; mean reciprocal rank, 0.710) and clinician-rated reasoning scores. Higher clinician-rated diagnostic reasoning quality was strongly associated with diagnostic correctness in mixed-effects logistic regression analyses ({beta} = 1.80; p < 0.001), corresponding to an approximately six-fold increase in odds of a correct diagnosis per 1-point increase in reasoning score. In an illustrative comparison, diagnostic accuracy of Claude Opus 4.5 fell within the range observed for psychiatry trainees. Conclusions and RelevanceLLMs demonstrated high diagnostic accuracy and generated clinical reasoning that clinicians judged to be largely coherent and safe. Diagnostic reasoning quality was more strongly associated with diagnostic correctness than data extraction quality, underscoring the importance of evaluating reasoning alongside accuracy when assessing LLMs for clinical decision support in psychiatry. Key PointsO_ST_ABSQuestionC_ST_ABSCan multiple large language models accurately diagnose psychiatric conditions and generate diagnostic reasoning that clinicians judge as coherent, safe, and clinically meaningful? FindingsAcross 196 psychiatric case vignettes, four large language models demonstrated high diagnostic accuracy. In a clinician-evaluated subset of 30 vignettes, model diagnostic accuracy fell within the range observed for psychiatry residents. Clinicians judged model-generated diagnostic reasoning to be largely coherent and safe. Higher clinician-rated reasoning quality was strongly associated with diagnostic correctness, independent of data extraction quality. MeaningEvaluating diagnostic reasoning, in addition to accuracy, may be important when assessing large language models for potential clinical decision support in psychiatry.
Reinecke-Tellefsen, C. J.; Orberg, A.; Ostergaard, S. D.
Show abstract
The COVID-19 pandemic had substantial impact on healthcare systems across the globe, including psychiatric services. Use of electroconvulsive therapy (ECT), a lifesaving intervention for severe mental illness, was reported to have declined during the pandemic in several countries, but nationwide data remain scarce. Using nationwide data from the Danish National Patient Register, we examined all ECT treatments administered in Denmark from September 2019 to May 2025. Weekly treatment numbers were visualized across the three national COVID-19 lockdowns to descriptively assess changes in ECT use. A notable reduction in ECT treatments was observed in the weeks preceding and during the first lockdown (March 11 to May 18, 2020). A post-hoc estimation indicated approximately 1,366 "missed" treatments during the initial pandemic phase in 2020. When these were added to the 27,033 treatments delivered in 2020, the adjusted total approximated annual treatment volumes in 2019 and 2022, suggesting a temporary disruption rather than sustained decline. In contrast, ECT activity during the second and third lockdowns appeared largely unaffected. These findings suggest that ECT provision in Denmark was temporarily reduced during the initial phase of the pandemic but remained resilient thereafter. In the case of a future pandemic, safeguarding timely access to ECT--particularly in early phases-- should be prioritized given its critical role in the treatment of severe mental illness.
Taosif, M.; Chaman, U. M.; Prova, N. A.; Taher, S. M.; Alam, M. G. R.; Rahman, R.
Show abstract
Mental health related problems in adolescents are not always properly evaluated because of incomplete evaluation methods that do not combine biological, behavioral, and demographic details. Therefore, our study proposes a twin-aware multimodal deep learning framework applied to the QTAB dataset for early prediction of adolescent anxiety disorders. We employ a 3D convolutional neural network for neuroimaging data and prototype-based learning modules with residual encoders for behavioral and phenotypic data. Each modality-specific encoder learns compact representations optimized for class-imbalanced prediction through multi-loss objective functions. Calibrated probability outputs from the three modules are combined via optimized weighted late fusion. The framework achieves an AUC of 0.8935 (95% CI: 0.792-0.969), representing an absolute gain of 11 percentage points over the best unimodal baseline (questionnaire: AUC = 0.7766), with a sensitivity of 85.7% and a specificity of 87.3%. Pairwise statistical testing indicated that the classification patterns of the fusion model differ significantly from the questionnaire-only baseline (McNemar p = 0.0008), though AUC differences did not reach statistical significance at this sample size (DeLong p > 0.05). The best fusion weights were 23% MRI, 63% questionnaire, and 14% phenotypic, highlighting the dominant role of behavioral data. These results demonstrate that calibrated late fusion of multimodal predictions provides robust performance for early adolescent anxiety screening in twin cohorts with family-aware evaluation protocols.
Sivak, L.; Forsman, J.; Sariaslan, A.; Tiihonen, J.; Fazel, S.
Show abstract
BackgroundForensic psychiatric services are expanding in many countries, and discharging patients from secure hospitals relies on accurate estimates of risk of adverse outcomes. Novel evidence-based tools for estimating one key risk, violent reoffending, have been developed in recent years. We aimed to externally validate one new tool, FoVOx, in forensic psychiatric patients sentenced to treatment, and to develop an updated model (FoVOx2), incorporating additional clinical predictors. MethodsUsing Swedish national registers, we conducted a temporal external validation of FoVOx by examining 767 patients discharged between 2014 and 2023. For the FoVOx2 cohort, 906 patients discharged between 2008 and 2023 were followed up, and additional predictors tested. The outcome was violent reconviction within 12 or 24 months. Model performance was evaluated using Harrells C-index, time-dependent AUCs, calibration, and classification metrics at predefined thresholds. ResultsIn temporal validation, FoVOx showed moderate discrimination (AUCs 0.69 and 0.71; C-index = 0.69) and acceptable overall accuracy (Brier <0.11). Calibration was generally good, with mild overestimation at the highest predicted risks (>20%) at 12 months and slight underprediction at 24 months. The updated FoVOx2 model newly incorporated clozapine treatment and additional diagnostic categories. It was associated with improved performance (AUCs 0.77; optimism-corrected C-index = 0.72; Brier 0.06 and 0.09) and achieved good calibration (intercept {approx} 0; slopes 1.03 and 1.05). ConclusionsUpdating risk assessment tools with additional clinical factors can lead to incremental improvement in model performance. Implementing tools should consider clinical utility and impact as next steps.
Flathers, M.; Nguyen, P. A. H.; Herpertz, J.; Granof, M.; Ryan, S. J.; Wentworth, L.; Moutier, C. Y.; Torous, J.
Show abstract
BackgroundMillions of people use language models to discuss mental health concerns, including suicidal ideation, but limited frameworks exist for evaluating whether these systems respond safely. Benchmarking, the practice of administering standardized assessments to language models, offers direct parallels to clinical competency evaluation, yet few clinicians are involved in designing, validating, or interpreting these assessments. AimsTo introduce mental health professionals to benchmarking language models by administering a validated clinical instrument and demonstrating how configuration decisions, measurement limitations, and scoring context affect result interpretation. MethodWe administered the Suicide Intervention Response Inventory (SIRI-2) programmatically to nine commercially available language models from three providers. Each item was presented 60 times per model (three prompt variants x two temperature settings x 10 repetitions), yielding 27,000 model responses compared against point-in-time expert consensus. ResultsTotal scores ranged from 19.5 to 84.0 (expert panel baseline: 32.5). Prompt design alone shifted individual model scores by as much as the difference between trained and untrained human groups. The best performing model approached the instruments measurement floor. All nine models consistently overrated clinically inappropriate responses that sounded supportive. ConclusionsA single benchmark score can support markedly different claims depending on the assumed standard of clinical behavior, the instruments remaining measurement range, and the configuration that produced the result. The skills required to make these distinctions must become core competencies. Benchmark results are increasingly utilized to support claims about mental health safety that may not be accurate, making it necessary to close the gap between clinical measurement and AI. Plain Language SummaryAI chatbots like ChatGPT, Claude, and Gemini are increasingly used by millions of people to discuss mental health problems, including thoughts of suicide. To assess whether these systems handle such conversations safely, researchers give them standardized tests called benchmarks and compare their answers to those of human experts. These scores are already used to argue AI systems are ready for clinical use. This study gave a well-established test of suicide response skills to nine AI models from three major companies under varying conditions. We changed how much instruction the AI received and how much randomness was built into its responses, then measured whether the scores changed. The same AI model could score like a trained crisis counselor under one set of conditions and like an untrained undergraduate under another, depending on choices the person running the test made. Every model also made the same kind of mistake: responses that sounded warm and caring were rated as appropriate, even when experts had judged them to be clinically problematic. The highest-scoring model performed so well that the test could no longer measure whether it was truly skilled or had simply exceeded the tests range. These findings show that a single score can be misleading without knowing how the test was run, whether it can still distinguish strong from weak performance, and whether it matches what the AI is used for. Mental health professionals routinely make these judgments about clinical assessments and are well positioned to bring that expertise to AI evaluation.
Kasyanov, E. D.; Mazo, G. E.
Show abstract
BackgroundLithium is one of the key medications for the treatment of bipolar disorder, but it requires therapeutic drug monitoring because of its narrow therapeutic window. In routine clinical practice, blood sampling is often performed outside the recommended 10-14 hour interval after the last evening dose, which distorts interpretation of the measured concentration (overestimation with early sampling and underestimation with late sampling) and may lead to inappropriate dose adjustment. ObjectiveTo develop and validate, using synthetic data, a multiplicative model (SimpLi) that standardizes a measured lithium concentration to the 12-hour level while accounting for sampling time and daily dose. Materials and MethodsA simulation study was conducted in accordance with ADEMP recommendations. A synthetic cross-sectional dataset (n = 1000) was generated with distributions of time since the last lithium dose, serum concentrations, and doses derived from the Bipolar CHOICE study, with a median sampling time of 12 hours (IQR 11-14) and a time-concentration correlation of r {approx} -0.30. The dataset was split 70/30 with stratification by time intervals, and 5-fold cross-validation was performed. Model performance was evaluated using RMSE, MAE, and R2. ResultsThe simulation closely reproduced the prespecified time distribution, achieved the target time-concentration correlation (r {approx} -0.30), and yielded a clinically plausible dose structure. A model using time as the only predictor showed limited accuracy (RMSE = 0.316; R2 = 0.108), while adding dose provided a moderate improvement (RMSE = 0.303; R2 = 0.177). When sampling occurred exactly at 12 hours, direct prediction was biased (-0.150; RMSE = 0.357), supporting the need for an individual correction factor. In a proof-of-concept analysis of five clinical cases, SimpLi produced a lower MAE than the eLi12 formula (0.042 vs 0.056 mEq/L). ConclusionsSimpLi is a practical tool (psyandneuro.ru/bekhterev-ai/simpli/) for standardizing lithium levels to 12 hours when sampling times vary. External validation on real-world data and robustness testing across clinical scenarios are needed.
Shi, Z.; Youngstrom, E. A.; Liu, Y.; Youngstrom, J. K.; Findling, R. L.
Show abstract
Pediatric bipolar disorder is challenging to diagnose accurately due to symptom heterogeneity. More standardized and data-driven approaches are needed to enhance diagnostic reliability. We evaluated a clinical decision tool (nomogram), statistical methods (logistic regression, LASSO), machine learning (support vector machine, random forest, k-nearest neighbors, extreme gradient boosting), and deep learning model (multilayer perceptron) for pediatric bipolar disorder prediction across two datasets collected in academic (N=550) and community (N=511) clinical settings. We compared three modeling strategies: cross-dataset validation, cross-dataset with interaction terms, and mixed-dataset. We assessed model performance using discrimination ability, calibration, and predictor importance ranking. In the baseline cross-dataset approach, all models showed good internal discrimination in the academic dataset; but external discrimination in the community dataset substantially declined. Interaction-enhanced models slightly improved internal discrimination but not external performance or calibration. Recalibration prominently improved cross-dataset calibration without compromising discrimination, indicating that transportability problems were largely driven by probability scaling. Models trained on mixed datasets exhibited much stronger external discrimination and calibration. Across models and training strategies, family risk and PGBI-10M were consistently ranked as the most important predictors. Predictive models for pediatric bipolar disorder showed strong internal performance but limited cross-setting generalizability due to dataset shift and miscalibration. Increasing model complexity did not improve external performance, whereas training on pooled data substantially improved both discrimination and calibration. Findings suggest that sampling diversity, rather than model complexity, is more valuable for developing clinically useful and generalizable psychiatric prediction models, underscoring the importance of open and collaborative datasets.
Provaznikova, B.; de Bardeci, M.; Altamiranda, E.; Ip, C.-T.; Monn, A.; Weber, S.; Jungwirth, J.; Rohde, J.; Prinz, S.; Kronenberg, G.; Bruehl, A.; Bracht, T.; Olbrich, S.
Show abstract
Objective: Major depressive episodes frequently show limited response to first-line treatments, motivating the search for objective biomarkers. EEG/ECG-based support tools aggregating electrophysiological predictors may guide treatment selection. We examined whether antidepressant treatments concordant with an EEG/ECG-biomarker report were associated with higher response rates. Methods: We retrospectively analyzed adults with ICD-10 depressive disorder or bipolar depression treated with electroconvulsive therapy (ECT), repetitive transcranial magnetic stimulation (rTMS), (es)ketamine, or selective serotonin reuptake inhibitors (SSRIs) between 2022 and 2024. Resting-state EEG with simultaneous ECG generated individualized biomarker reports with modality-specific response likelihoods. Treatment chosen by clinical teams was classified as concordant or non-concordant; response was derived from routinely collected clinical scales. Results: Among 153 patients (ECT n=53, rTMS n=48, (es)ketamine n=36, SSRIs n=16), response rates were higher for concordant vs non-concordant treatments: ECT 70% vs 50%, rTMS 30% vs 13%, (es)ketamine 31% vs 10%, and SSRIs 100% vs 11%. Overall, 46% (42/92) of concordant vs. 26% (14/54) of non-concordant patients responded (absolute difference +20 percentage points; relative increase {approx}77%; number needed to treat {approx}5). Conclusion: Concordance with EEG/ECG biomarkers correlated with higher treatment response, warranting confirmation in prospective trials. Significance: EEG/ECG-based decision support may enhance antidepressant treatment response in everyday clinical practice.
Donegan, M. L.; Srivastava, A.; Peake, E.; Swirbul, M.; Ungashe, A.; Rodio, M. J.; Tal, N.; Margolin, G.; Benders-Hadi, N.; Padmanabhan, A.
Show abstract
The goal of this work was to leverage a large corpus of text based psychotherapy data to create novel machine learning algorithms that can identify suicide risk in asynchronous text therapy. Advances in the field of natural language processing and machine learning have allowed us to include novel data sources as well as use encoding models that can represent context. Our models utilize advanced natural language processing techniques, including fine-tuned transformer models like RoBERTa, to classify risk. Subsequent model versions incorporated non-text data such as demographic features and census-derived social determinants of health to improve equitable and culturally responsive risk assessment, as well as multiclass models that can identify tiered levels of risk. All new models demonstrated significant improvements over our previous model. Our final version, a multiclass model, provides a tiered system that classifies risk as "no risk," "moderate," or "severe" (weighted F1 of 0.85). This tiered approach enhances clinical utility by allowing providers to quickly prioritize the most urgent cases, ensuring a more accurate and timely intervention for clients in need.
Kizilaslan, B.; Mehlum, L.
Show abstract
Purpose: Suicide and self-harm are major public health concerns characterized by substantial clinical and psychosocial heterogeneity. While latent class analysis has been used to identify subgroups of people with suicidal behavior, the extent to which such population-level phenotyping complements explainable artificial intelligence-based classification models remain unclear. Methods: We applied latent class analysis to a cross-sectional, publicly available dataset of 1000 individuals presenting with self-harm and suicide-related behaviors at Colombo South Teaching Hospital, Kalubowila, Sri Lanka. Sociodemographic, psychosocial, and clinical variables were used to identify latent subgroups. Class characteristics and suicide prevalence were examined and compared with variable importance patterns reported in a previously published explainable artificial intelligence (XAI)-based suicide classification study using the same dataset. Results: Four latent classes were identified. Two classes exhibited very high suicide prevalence (91.2% [95% CI: 87.7-93.8] and 99.0% [95% CI: 96.4-99.7]), whereas two classes showed low prevalence (<1%). The two high-prevalence classes differed markedly in lifetime psychiatric hospitalization history, with one class showing a 100% prevalence of prior hospitalization and the other substantially lower hospitalization rates. These patterns partially aligned with, and extended beyond, variable importance findings from the XAI-based model. Conclusion: Latent class analysis identified distinct subgroups with substantially different suicide prevalence and clinical profiles, underscoring the heterogeneity of individuals presenting with self-harm. Comparison with XAI-based suicide classification model findings suggest that unsupervised phenotyping and supervised classification provide complementary perspectives, offering population-level context that may enhance the interpretability of suicide assessment frameworks. Keywords: suicide; self-harm; latent class analysis; explainable artificial intelligence; machine learning
Cudic, M.; Meyerson, W. U.; Wang, B.; Yin, Q.; Khadse, P. N.; Burke, T.; Kennedy, C. J.; Smoller, J. W.
Show abstract
BackgroundLongitudinal measurement of depression severity in outpatient psychiatric care is limited by infrequent standardized assessments. Although psychiatric clinical notes capture illness burden and functional impairment, this information is rarely quantified for analysis. ObjectiveTo evaluate whether large language models (LLMs) can infer clinically meaningful measures of depression severity from outpatient psychiatry notes. MethodsWe sampled 91,651 outpatient psychiatry notes from 8,287 adult patients across 58 clinics within a large academic medical center between 2015 and 2021. A HIPAA-compliant LLM (OpenAI GPT-5.2) was prompted to independently estimate three depression severity scores (Patient Health Questionnaire-9 [PHQ-9], Hamilton Depression Rating Scale [HAM-D], and depression-specific Clinical Global Impression-Severity [CGI-S]) from notes, with patient-reported PHQ-9 content within notes redacted to prevent biasing. Convergent validity was assessed against patient-reported PHQ-9 (n=3,757), study-clinician chart review (n=125), and treating-clinician suicide risk assessments (SRA; n=2,985). Predictive validity was evaluated using survival models of antidepressant switching and psychiatric emergency visits. Discriminant validity across diagnoses and consistency across demographic groups and clinics were also evaluated. Results10.8% of eligible visits had a PHQ-9 recorded within 7 days before the encounter. LLM-inferred PHQ-9 scores showed moderate agreement with patient-reported PHQ-9 (Cohens {kappa}=0.64, 95%CI:0.62-0.66; Pearson r=0.67, 95%CI: 0.65-0.68). Stronger agreement was found between LLM CGI-S and study-clinician chart review ({kappa}rater1=0.79, 95%CI: 0.70-0.85; {kappa}rater2=0.67, 95%CI: 0.58-0.77; r=0.86 with mean rating, 95%CI: 0.80-0.90). In prospective analyses, LLM CGI-S predicted antidepressant switching (C-index=0.60; CI95%: 0.58-0.62) and psychiatric emergency visits (C-index=0.63; 95%CI: 0.57-0.68), which was comparable to the predictive performance of patient-reported PHQ-9 and treating-clinician SRA. Correlations between LLM CGI-S and patient-reported PHQ-9 were consistent across clinics (I2<0.1) but significantly lower among Black (r=0.48, 95%CI: 0.38-0.57) and Hispanic (r=0.43, 95%CI: 0.27-0.56) patients. ConclusionsLLM-inferred depression severity scores from psychiatric outpatient notes support longitudinal, standardized phenotyping of depression severity, such as for routine outcome monitoring. These results have implications for facilitating genetic, pharmacoepidemiologic, and antidepressant treatment effectiveness studies using real-world evidence.
Zhu, T.; Tashevski, A.; Taquet, M.; Azis, M.; Jani, T.; Broome, M. R.; Kabir, T.; Minichino, A.; Murray, G. K.; Nour, M. M.; Singh, I.; Fusar-Poli, P.; Nevado-Holgado, A.; McGuire, P.; Oliver, D.
Show abstract
Psychosis prevention relies on early detection of individuals at clinical high risk for psychosis (CHR-P) remains limited, constraining preventive care. The effectiveness of the CHR-P state is constrained, in part due to clinical assessments requiring specialist interpretation of narrative interviews, limiting scalability. Here, we evaluate whether large language models (LLMs; deep learning models trained on large text corpora to process and generate language) can extract clinically meaningful information from such interviews to support psychosis risk assessment. We assessed 11 open-weight LLMs on 678 PSYCHS interview transcripts from 373 participants (77.7% CHR-P). Models inferred CHR-P status and estimated severity and frequency across 15 symptom domains, benchmarked against researcher-rated scores. Larger models achieved the strongest classification performance (Llama-3.3-70B: accuracy = 0.80, sensitivity = 0.93, specificity = 0.58). LLM-generated symptom scores showed good correlations with researcher-rated scores (ICCsev = 0.74, ICCfreq = 0.75). Performance disparities were minimal across most demographic groups but varied across sites. Generated summaries were largely faithful to source transcripts, with low rates of clinically relevant confabulation (3%). Errors primarily reflected over-pathologisation of non-clinical experiences. While accuracy scaled with model size, smaller models achieved competitive performance with substantially lower computational cost. These findings demonstrate that open-weight LLMs can assess psychosis risk from clinical interview transcripts, supporting scalable, human-in-the-loop approaches to early detection.
Rohde, C.; Ostergaard, S. D.
Show abstract
ObjectivesElectroconvulsive Therapy (ECT) is an effective treatment for bipolar disorder, particularly in severe acute cases or for illness resistant to pharmacotherapy. However, the risk of relapse following ECT is high, necessitating intervention to reduce this risk. Based on findings from ECT studies in unipolar depression and its well-known mood-stabilizing properties, it is likely that lithium treatment may reduce the risk of relapse of bipolar disorder following ECT. Therefore, we conducted a target trial emulation using data from Danish nationwide registers to investigate whether lithium protects against relapse following ECT treatment of bipolar disorder. MethodsPatients discharged from their first psychiatric admission with a primary diagnosis of bipolar disorder between January 1, 2006, and June 1, 2024, who received at least six ECT treatments, were included. Follow-up began two weeks after discharge and continued until relapse, death, one year, or January 1, 2025. Patients were considered allocated to lithium treatment if they redeemed a prescription for lithium within the first two weeks after discharge from the index admission (ECT treatment). The outcome was time to relapse, defined by either psychiatric hospital admission or suicide. Cox proportional hazards regression, adjusted for potential confounders, was used to compare the outcome between patients allocated and not allocated to lithium treatment. ResultsAmong the 574 eligible patients (mean age 41.5 years, 61.3% women), 214 (37.3%) were allocated to lithium treatment and 360 (62.7%) were not allocated to lithium treatment. During follow-up, 56 patients (26.2%) in the lithium group and 135 patients (37.5%) in the non-lithium group experienced a relapse. Lithium treatment was associated with a substantially reduced risk of relapse (adjusted hazard rate ratio, 0.60, 95% CI=0.43-0.84). ConclusionLithium treatment after ECT may reduce the risk of relapse in patients with bipolar disorder. These findings should be followed up by a randomized controlled trial.
Pruin, E.; Milaneschi, Y.; Bartels, M.; Bassani, P.; Penninx, B. W.; Peyrot, W. J.
Show abstract
BackgroundGenetic liability of depressive disorder can be captured by psychopathology in relatives (family history). Various methods summarize family history in a single score, differing in included information as well as underlying model. We systematically compared the performance of family history indicators, including promising new indicators based on the liability threshold model, in predicting depressive disorder. MethodsWe calculated selected family history indicators for depression (dichotomous, proportion, novel genetically-informed method PAFGRS) in 1339 participants of the Netherlands Study of Depression and Anxiety (Ncase= 1086). Polygenic scores were computed from the most recent GWAS for major depression. We assessed correlations between genetic liability indicators, as well as their prediction of lifetime depressive disorder diagnosis. ResultsCorrelations of family history indicators with each other were high (r = 0.71 - 0.99), and much lower with the PGS (r = 0.15). There was a suggested increase in predictive accuracy for more elaborately computed scores, ranging from proportion (AUC = 0.66, OR = 2.26, 95%CI = 1.88-2.71) to PAFGRS (AUC = 0.70, OR =17.06, 95%CI = 9.46 - 30.77). The best-performing family history indicator and the PGS were independently associated with depressive disorder (PAFGRS: OR = 15.17, 95%CI = 8.36-27.51, p = 3.59x10-19; PGS: OR = 1.30, 95%CI = 1.12-1.50, p = 0.0004). ConclusionsOur analysis shows that more elaborate family history indicators, including family size, prevalence, heritability and based on genetic theory, would be preferrable over simpler methods. Family history and PGS were complementary in prediction, showing the added value of including both in future studies.
Bamberger, R.; Kuhles, G.; Lotter, L. D.; Dukart, J.; Konrad, K.; Guenther, T.; Siniatchkin, M.; Fuchs, M.; von Polier, G.
Show abstract
Background Diagnosis and treatment monitoring of attention-deficit/hyperactivity disorder (ADHD) largely rely on subjective assessments, highlighting the need for objective markers. Voice features and speech embeddings represent promising candidates for such markers, as they may capture alterations in speech production relevant to ADHD. However, it remains unclear which speech features are most informative for distinguishing ADHD and monitoring treatment effects, and which speech tasks most reliably elicit such differences. Methods Twenty-seven children with ADHD and 27 age-matched neurotypical controls completed six speech tasks across two study visits. Children with ADHD were unmedicated at baseline (first visit) and were assessed under prescribed methylphenidate treatment at follow-up, whereas controls underwent repeated assessment without intervention. Established acoustic voice features (eGeMAPS) and high-dimensional speech embeddings (WavLm, Whisper) were extracted and analysed using linear mixed models to examine baseline group differences and group-by-time interaction effects reflecting medication-associated change patterns. Results At baseline, children with ADHD differed significantly from controls in frequency, spectral, and temporal voice features, characterized by lower and more variable pitch, altered spectral properties, and reduced rhythmic stability. Group-by-time interaction effects indicated medication-associated modulation in the ADHD group, including reduced loudness variability and increased precision of vowel articulation at follow-up, changes not observed in controls. Speech embeddings revealed additional baseline and interaction effects beyond established acoustic features. Free speech tasks, particularly picture description, yielded the most robust and consistent effects. Conclusion Children with ADHD differed from neurotypical controls in vocal features at baseline and showed distinct longitudinal change patterns consistent with medication-related change. These findings support further investigation of speech-based measures as candidate digital phenotypes and potential digital biomarkers in ADHD, with picture description emerging as a particularly promising task for future clinical assessment protocols.
Sharma, S.; Golden, R. M.; Montgomery, J. W.; Gillam, R. B.; Evans, J.
Show abstract
Because both monothetic and polythetic diagnostic classification approaches focus on the presence of individual symptom(s) to identify individuals in a clinical population, they may be diagnostically sensitive clinical markers of multidimensional disorders such as developmental language disorder (DLD). DLD researchers have also used likelihood ratios (LHs) to identify possible diagnostic clinical markers of DLD, however the diagnostic sensitivity of LHs varies markedly across studies. A recent multidimensional computational elastic-net regression examined a total of 71 measures of spoken language and cognitive processing from a cohort of 223 children ages 7;0 to 11;0 with and without DLD (DLD = 110; typically developing (TD) controls = 113). All 200 iterations of the model had high discriminative power (87% - 88%) in positively identifying and distinguishing the DLD participants across all thresholds. Notably, the models identified a sparse DLD-specific deficit profile which only included nine of the 71 measures. In this study, we ask if the individual LHs for each of these nine measures are equally sensitive in identifying and discriminating the children with DLD from TD controls or if diagnostic markers of multidimensional disorders such as DLD can only be identified based on computational modeling approaches. The LHs for each of the nine measures were in the moderately high ranged (3.25 - 10). However, at the the highest LH cut points for each measure, there was little to no overlap in the children each measure identified as having DLD. Follow up analysis revealed that the elastic net model-derived predictive scores for each participant were significantly correlated with the participants language ability. The model also identified a subgroup of TD participants as having the same DLD-deficit profile as the DLD participants. This subgroup were younger, predominantly male participants whose standardized language assessment scores were lower as compared to the larger TD cohort. Taken together, the results from this study show that, because multidimensional modeling approaches such as elastic net regression leverage the variability in the deficit profiles across individual members of a diagnostic group and the unique contributions of each of the behavioral features of the phenotype, they may be an effective tool in deriving diagnostically specific deficit profiles for phenotypically complex, multicausal, multidimensional, neurodevelopmental disorders such as DLD. The results also demonstrate the robustness of the derived DLD-specific deficit profile in identifying individuals with "mild" or subclinical DLD, demonstrating the potential utility of this approach in both clinical and research arenas. What this paper adds.O_ST_ABSWhat is already known on this subject.C_ST_ABSThe identification of diagnostic markers for DLD has been a challenge for both clinicians and researchers across multiple decades. Monothetic classification markers such as non-word repetition, optional infinitive, or syntax dependencies have been explored, as well as polythetic classification approaches where a list of diagnostic symptoms is used together. However, each assumes different criteria and symptoms that should be included as diagnostic markers of DLD. What this study adds.Our study assessed the feasibility and effectiveness of monothetic vs. polythetic classification approaches for identifying DLD. Since our prior work, which used elastic net logistic regression computational modeling with strong discriminatory power, consistently selected nine key features as the DLD-deficit profile, in this effort, we calculated each of the nine features likelihood ratios to examine each measures ability to identify children with DLD. The monothetic approach failed to identify a consistent set of children with DLD, and the polythetic classification approach also did not identify participants who were shown to have mild DLD by the elastic net modeling approach. Instead, our analysis showed that a computational modeling approach, such as elastic net regression, that included small but important input from multiple cognitive and linguistic aspects of children, could better capture multifaceted information about the disorder, better account for individual variability, and consistently identify most participants with DLD. Clinical implications of this study.Elastic net logistic regression identifies a small subset of important features for distinguishing DLD and can assign a probability of DLD presence for each participant. Instead of the polythetic and monothetic approaches commonly used in the field, our study shows that integrating advanced computational modeling, such as elastic net regression, with clinician judgment can better refine assessment processes and address prior and ongoing inconsistencies in the DLD literature and diagnostic practices.
Radlowski Nova, J.; Lopez-Carbonero, J. I.; Corrochano, S.; Ayala, J. L.
Show abstract
BackgroundMixed-format lifestyle questionnaires contain both structured variables and free-text responses, but it remains unclear whether language-derived variables provide incremental predictive value beyond structured data, and under which representational condition. It was investigated whether variables derived from patient-reported free text improve ALS-versus-control classification beyond structured questionnaire data, and whether their value depends on how temporal information is represented. MethodsA leakage-free machine-learning pipeline was developed to classify ALS versus controls from questionnaire-derived data, including a schema-guided LLM-based text-to-table extraction and a compact longitudinal encoding strategy. Three feature configurations were compared: Pool1, containing structured baseline variables only; Pool2, adding compact summaries derived from first-time-point (T1) free-text responses; and Pool3, further incorporating compact descriptors of change between T1 and T2. Logistic Regression, linear Support Vector Classification, and Random Forest were evaluated using repeated stratified holdout (10 seeds) and repeated stratified 5-fold cross-validation. Final ablation analyses were performed to isolate the contribution of the compact text block and the compact temporal block. ResultsAfter leakage correction, performance estimates became more conservative, indicating that previous results had been optimistic. In the final configuration, Pool3 achieved the best performance, with Random Forest reaching a holdout accuracy of 0.673, F1-weighted score of 0.666, and Matthews correlation coefficient of 0.323; cross-validated F1-weighted score and Matthews correlation coefficient were 0.654 and 0.312, respectively. Pool2 did not show a robust improvement over Pool1. Ablation analysis showed that removing the compact temporal block markedly reduced Pool3 performance, whereas removing the compact text block had little overall effect. These findings indicate that the primary value of language-based processing in small clinical cohorts lies not in static feature enrichment, but in enabling compact representations of longitudinal change. ConclusionsIn this setting, the main predictive gain did not arise from static text-derived variables alone, but from representing questionnaire information as compact longitudinal change descriptors. These findings suggest that, in small clinical cohorts, the value of language-based processing may lie more in summarizing trajectories than in expanding static feature spaces.
Nasir, R.; Chen, Y. R.; Morales Sierra, M.; Jacob, J.; Iyeke, L.; Jordan, L.; Paperwalla, K.; Richman, M.
Show abstract
IntroductionSepsis is a life-threatening ailment caused by an exaggerated immune response to infection that poses a major health problem, with increasing prevalence, high costs, and poor outcomes. Improved outcomes are seen in patients when providers follow the Surviving Sepsis Campaign recommended clinical practice guidelines for identifying and treating sepsis using a 3-hour and 6-hour bundle after sepsis is suspected. Previous research has shown patients with mental health issues receive worse quality of diabetes and cardiac care and have poorer outcomes compared with those without mental health issues. Similarly, patients with mental health issues may receive worse sepsis care due to inability to explain symptoms, agitation, etc. This study explores sepsis quality of care among patients with vs. without an acute mental health crisis, and whether patients with certain mental health issues were more likely to receive sepsis bundle care than others. MethodsUsing data extracted from 2018-2019 at the Long Island Jewish Medical Center Emergency Department (ED), patients who met sepsis inclusion criteria were grouped into either having, or not having, a severe mental illness crisis on the basis of whether physical or chemical restraints were used in the ED. Patients with a history of a severe mental illness, but who were not in a severe mental health crisis, were grouped with the patients without mental health illness, as, in the absence of an acute psychiatric problem, their mental health issue unlikely affected sepsis care. We describe demographic characteristics of both groups and performed a univariate analysis using Students T-test to compare the percent of those with vs. without acute mental health crisis who received full 3- and 6-hour sepsis bundle care. Patients with an acute mental health crisis were grouped according to "cognitive" (eg, dementia) vs. "non-cognitive" (eg, schizophrenia) disorders. ResultsComparing those with vs. without acute mental health crisis, there was no difference in the percent of patients who received 3-hour sepsis bundle care (80.7% vs 74.9%, p = 0.1456). However, among patients who received the 3-hour bundle, a significantly-greater percent of those with an acute mental health crisis received the 6-hour sepsis bundle (51.0% vs. 30.7%, p <0.0001). There was no difference between different groups of patients with mental health issues (eg, "cognitive" vs. "non-cognitive") with respect to receiving 3- or 6-hour sepsis bundle care. DiscussionSurprisingly, although there was no significant difference in likelihood to receive a 3-hour sepsis bundle among patients with vs. without an acute mental health crisis, those with an acute mental health crisis were more-likely to receive 6-hour care. We suspect this difference might be due to increased attention paid to patients with an acute mental health crisis, including more-frequent room visits by hospital staff or more concerns among family members. No particular set of mental health conditions was associated with receiving or not receiving appropriate care. Future research could address possible confounding factors, go into more detail about the specific component of the sepsis protocol that patients failed to receive, and specify what aspects of a mental health crisis affected treatment plans. Future studies are needed to assess possible associations between severe mental illness crisis, bundle care, and mortality in relation to ED, Intensive Care Unit (ICU), or hospital length-of-stay (LOS).